Supervised Learning is currently the most used type of Machine Learning. Why? Simply because it's simple to understand, cheap and efficient to deploy into Production Environments, and helps automate various business processes using a data-driven approach. This tutorial will cover the following learning objectives:
Linear Regression
Logistic Regression
Naive Bayes
Decision Trees
Random Forests
K-Nearest Neighbors
Support Vector Machine (SVM)
Linear Regression
Summary
Linear Regression is a linear approximation of a causal relationship between two or more variables. The basis for this algorithm is the slope-intercept formula: y = mx + b
The Dependent Variable "y" in the slope-intercept formula, is predicted using one or more Independent Variables, the "mx" and "b" in the slope-intercept formula.
The Simple Linear Regression Model is used to predict a dependent variable based on a single independent variable. This model is represented by the following equation (for complete populations): y = β0 + β1X1 + ε
y represents the dependent variable that is being predicted, for example the price of a house.
β0 represents the Constant, on the minimum value the dependent value can be without any interaction with the independent variable.
x represents the independent variable that is being used to predict the dependent variable, for example the square footage of the house.
β1 represents the coefficient related to the independent variable. This shows that for every one unit increase in the independent variable, the dependent variable will be increased by that much.
ε equals the measured error between the observed values of the dependent variable, and the predicted values of the dependent variable. This is used to offset the difference between the predicted values produced by your model and the actual values shown in your Training Data.
For incomplete populations, or samples, the following equation is used: ŷ = b0 + b1x1
A Causal Relationship is where there is a clear relationship between the independent and dependent variable. Whether positive or negative, a perfect causal relaitionship is the equvivalent to "if x increases by 1, then y increases by 1", or "if x increases by 1 then y descreases by one".
NOTE: Understanding the concept of causal relationships is critical to becoming an effective Data Scientist or ML Engineer. This concept is typically called "Model Inference". We will discuss this topic in a later tutorial.
NOTE: Linear Regression models only work well with numeric features. Thus, if you have categorical features (e.g., color, gender), then you must exclude those to get an effective model.
Logistic Regression
Summary
Regression is a statistical method for modeling relationships between variables. It makes it possible to infer or predict a variable based on one or more other variables.
The difference between Linear Regression and Logistic Regression is that with Linear Regression, the dependent variable is a continuous value, whereas with Logistic Regression, the dependent variable is a binary value (can only be one of two possible values).
Logistic Regression is represented by the following equation: f(z) = 1 / 1 + e (to the power of "-z")
z represents the equation used for Linear Regression.
Since Logistic Regression is used to predict the probability of an outcome, the Linear Regression Equation is used to find how each independent variable contributes, positively or ngeatively, to the probability of the outcome.
When interpreting the results of a Logistic Regression Model, remember that the floating point number assigned to the target variable is the probability that the sample meets the condition (e.g., if the coefficient for an independent variable is 0.08, that means there is a positive correlation between the independent and dependent variables where for every increase in the independent variable, the probability of the dependent variable meeting the positive condition increases by 8%).
NOTE: Just like with Linear Regression, Logistic Regression Models only work well with numeric features. Thus, if you have categorical features (e.g., color, gender), then you must exclude those to get an effective model.
Naive Bayes
Summary
Naive Bayes is a common classification algorithm used by Data Scientists to determine the class of a target variable based on categorical independent variables. This algorithm can be used for either Binary Classification or Multiclass Classification.
The primary difference between Naive Bayes and Logistic Regression is that Logistic Regression requires numeric independent variables, whereas Naive Bayes can work with string-based, or categorical, independent variables.
Naive Bayes is based on Bayes Theorem, a conditional theorem, which can be described as an evidence-based trust theorem (the more evidence present to support the condition, the more trust is given). Bayes Theorem is based on the concept of "how much you should trust the evidence". Bayes Theorem is represented by the following equation: P(A|C) = P(C|A) * P(A)/ P(C)
P(A) equals the Prior Probability, which describes the degree to which we believe the model accurately describes reality based on all of our prior information.
P(C|A) equals the Likelihood, which describes how well the model predicts the data.
P(C) equals the Normalizing Constant, which is the constant that makes the Posterior Density integrate to one (add up all probabilities to 1.0).
P(A|C) equals the Posterior Probability, which represents the degree to which we believe a given model accurately describes the situation given the Training Data.
The reason why Naive Bayes is "Naive" is because if there are two or more features that are not correlated, then the probability of seeing those features in the same class is just the product of all the probabilities.
Naive Bayes Pros and Cons:
Pros:
It's fast and easy to predict a class.
It provides better performance than Logistic Regression when conducting Binary Classification.
It performs well in the case of categorical features compared to numerical features.
Cons:
If you forget to include all possible classes of the target variable in the Training Data, the Tetsing Data will predict a 0 probability score for the samples matching that missing class. Although this can be fixed using smoothing techniques, it's just another step to work through.
Since Naive Bayes heavily relies on probabilities, it's bad at estimating classes based on feature correlations.
If independent features are present, which can be defined as features that are not correlated to any other features, this can cause bias in the classification process.
NOTE: Naive Bayes is traditionally only used when real-time decisions need to be made, since Decision Trees, Random Forests, and Neural Networks tend to provide much better results with lower error rates.
Decision Trees
Decision Tree Classification
Decision Tree Regression
Summary
Decision Trees are binary trees that recursively split the Training Data into separate classes until a prediction is returned. These are commonly used by Data Scientists and ML Engineers to conduct Binary and Multi-Class Classification.
There are three types of nodes within a Decision Tree: The Root Node is the base of the tree and create the first split of the data based on a specific condition, Decision Nodes further split the data until it has reached a specific threshold, and Leaf Nodes contain the final predictions that assign a class to the data points inside.
Pure Leaf Nodes occur when all the data points in a Leaf Node are assigned to a single class. When you have a large number of data points, this may not be possible.
Information Gain is the process used by Decision Trees to find the optimal conditions to create pure leaf nodes. The best possible information gain split is where the data gets split evenly into two classes. When working with more than two classes, the best possible scenario is to find the smallest amount of conditions to create pure leaf nodes.
Entropy is the measure of information contained in a state. In the context of Decision Trees, the state would be the subset of data contained in each Decision Node. The higher the Entropy, the less amount of clear information is present in the Decision Node.
When using a Decision Tree for Regression, also known as a Regression Tree, you'll run into leaf nodes that have more than one data point. Rather than assigning each data point the same value, we find the mean of the data points by summing them and dividing by the number of points in the leaf node. This will produce the prediction for the target variable.
Variance Reduction is a Regression technique in which the goal is to reduce the amount of space between the data points. When visualizing a Regression model, you'll view the "line of best fit", the line of best fit, or trend line, represents all the predicted values. This is created by finding either the Root Mean Squared Error (RMSE) or Sqaured Correlation (r-squared). We'll discuss these metrics in a later tutorial.
In the context of Regression Trees, Variance Reduction is used to reduce the amount of space between the data points in each decision node. This is the equivalent of how Entropy is used to determine the ideal split condition for each Decision Node.
NOTE: Decision Trees are the best Supervised Learning algorithm for both Classification and Regression when working with categorical features (e.g., make and model of a car).
Random Forests
Summary
Random Forests are a collection of Decision Trees that are used by Data Scientists and ML Engineers to reduce Overfitting (a concept we will discuss in a later tutorial).
When creating a Random Forest, you must specify the number of trees you'd like to fit your Training Data to. Bootstrapping is the concept of grabbing random samples and features from the training dataset and creating subsets to be used in each tree. Each subset will contain the same number of rows as the original Training Data, but NOT the same number of features.
When using a Random Forest for Classification, new data points are passed through each Decision Tree and the class with the most "votes" from each tree gets assigned. This helps your model understand the importance of each individual feature and how they interact with each other in the context of correlation.
Aggregation is the process of collecting the results from each Decision Tree created by the Random Forest and counting the number of occurrences of each class from all trees. The class with the highest number of occurrences gets assigned to the data point.
Random Forests are better than basic Decision Trees becuase they allow your model to give weights to features based on the impact they have on predictions. If a single feature consistently provides the same predictions as other trees, it's considered to be very important.
When deciding how many features to include in each Decision Tree, it's wise to use the log or square root of the number of features available. This has been found to be the best method to reduce observational bias.
When used with Regression problems, rather than aggregating the count of each class, the mean of all predictions is taken to provide the output prediciton for each data point.
NOTE: Whenever you suspect bias in your data, or any form of uneven distribution, always go with Random Forests. This will reduce overfitting and focus more on important features rather than equal weights on all features.
K-Nearest Neighbors
Summary
K-Nearest Neighbors is an algorithm used to classify a data point based on its proximity to other data points. This is a sister algorithm commonly used in conjunction with K-Means Clustering, an Unsupervised Learning algorithm explained in the next tutorial.
K refers to the number of nearest neighbors included in the cluster.
KNN is used explicitly for classification, thus your label must have distinct classes. When a new data point is introduced, such as in a Testing Dataset, you specify the "K" nearest neighbors to include in the classification and the majority voting occurs, meaning the class with the most representation in the specified cluster will be assigned to that data point.
Euclidean distance is the default measure used for finding the nearest neighbors. This concept is covered in the Statistics Tutorial Series (Coming Soon).
KNN is very advantageous for working will large Training Datasets, requires no Training Data as it simply runs the algorithm on the base set, and is well liked for its ability to learn complex data models easily.
KNN is NOT advantageous when you're working with a high number of features, as this will increase the amount of computing power requried dramatically. It can also be difficult to determine the optimal number for "K".
NOTE: As mentioned in the video, "K" should ALWAYS be an odd number to avoid ties in the voting process.
NOTE: Whenever you suspect bias in your data, or any form of uneven distribution, always go with Random Forests. This will reduce overfitting and focus more on important features rather than equal weights on all features.
Support Vector Machines
Summary
A Support Vector Machine (SVM) is a classification algorithm best suited for extreme cases. In its most simple form, an SVM can only be used for binary classification.
A Hyperplane is the gap between the two classes that help the algorithm differentiate between the two classes. The boundaries of the hyperplane are drawn by Support Vectors, which are the extreme data points in each class. These support vectors constitute the end of the class as far as the maximum or minimum amounts in each feature (e.g., weight, height, age).
Rather than relying on the entire Training Dataset for understanding the relationships between features, SVMs rely on the Support Vectors for determining classes.
D+ is the shortest distance to the first positive point (first class), whereas D- is the shortest distance to the first negative point (second class). The Positive and Negative terms com from the "Binary" nature of the classification. In your Training Data, your label could be titled "is_dog" and be marked 1 if the sample is a dog, otherwise 0 if the sample is a cat.
SVMs support multi-dimensional classifications where you use more than two features to predict the output. This is extremely difficult to visualize, especially if you have more than four features. Could you image a 7-Dimensional Scatter Plot?
Kernels allow you to work with non-linear data without having to use exponentially more computing power. There are various types of kernels, but the basic idea behind them is you input your Training Data, specify how you want it to be transformed, and it returns the dot product of each data point to make it possible to create a clear Hyperplane.
The dot product is an algebraic operation that takes two equal-length sequences of numbers (such as data points in a Training Dataset), and returns a single number. Click here to learn more.
Common Kernel Types include the following:
Polynomial Kernel
Radial Basis Function (RBF) Kernel
Sigmoid Kernel
SVM Advantages:
SVMs are highly effective in high-dimensional spaces. If you have a high number of features, SVMs should be highly considered.
Different Kernel Functions are available for various decision functions. This makes SVMs very flexible with both linear and non-linear Training Datasets.
SVM Disadvantages:
SVMs perform poorly when the number of features exceed the number of samples (very rare when working with large datasets).
SVMs don't provide probability metrics for its classification assignments, as compared to Naive Bayes.